View PRESENTATION_2028057.PDF datasheet online --- IC-ON-LINE

Datasheet File OCR Text:

amd?s next generation microprocessor architecture fred weber october 2001
2 "hammer" goals ? build a next - generation system architecture which serves as the foundation for future processor platforms ? enable a full line of server and workstation products ? leading edge x86 (32 - bit) performance and compatibility ? native 64 - bit support ? establish x86 - 64 instruction set architecture ? extensive multiprocessor support ? ras features ? provide top - to - bottom desktop and mobile processors
3 agenda ? x86 - 64? technology ? "hammer" architecture ? "hammer" system architecture
x86 - 64? technology
5 why 64 - bit computing? ? required for large memory programs ? large databases ? scientific and engineering problems ? designing cpus j ? but, ? limited demand for applications which require 64 bits ? most applications can remain 32 - bit x86 instructions, if the processor continues to deliver leading edge x86 performance ? and, ? software is a huge investment (tool chains, applications, certifications) ? instruction set is first and foremost a vehicle for compatibility ? binary compatibility ? interpreter/jit support is increasingly important
6 x86 - 64 instruction set architecture ? x86 - 64 mode built on x86 ? similar to the previous extension from 16 - bit to 32 - bit ? vast majority of opcodes and features unchanged ? integer/address register files and datapaths are native 64 - bit ? 48 - bit virtual address space, 40 - bit physical address space ? enhancements ? add 8 new integer registers ? add pc relative addressing ? add full support for sse/sseii based floating point application binary interface (abi) ? including 16 registers ? additional registers and data size added through reclaim of one byte increment/decrement opcodes (0x40 - 0x4f) for use as a single optional prefix ? public specification ? www.x86 - 64.org
7 x86 - 64 programmer?s model rax 63 added by x86 - 64 xmm8 xmm8 xmm15 xmm15 r8 r8 r15 r15 ah eax al 63 g g p p r r x x 8 8 7 7 0 79 31 0 7 15 in x86 xmm0 xmm0 xmm7 xmm7 s s s s e e & & s s s s e e 2 2 127 0 eax eax edi edi eip eip 0 31 program program counter counter eax ah al
8 x86 - 64 code generation and quality ? compiler and tool chain is a straight forward port ? instruction set is designed to offer all the advantages of cisc and risc ? code density of cisc ? register usage and abi models of risc ? enables easy application of standard compiler optimizations ? specint2000 code generation (compared to 32 bit x86) ? code size grows <10% ? due mostly to instruction prefixes ? static instruction count shrinks by 10% ? dynamic instruction count shrinks by at least 5% ? dynamic load/store count shrinks by 20% ? all without any specific code optimizations
9 x86 - 64? summary ? processor is fully x86 capable ? full native performance with 32 - bit applications and os ? full compatibility (bios, os, drivers) ? flexible deployment ? best - in - class 32 - bit, x86 performance ? excellent 64 - bit, x86 - 64 instruction execution when needed ? server, workstation, desktop, and mobile share same architecture ? os, drivers and applications can be the same ? cpu vendors focus not split, isv focus not split ? support, optimization, etc. all designed to be the same
the "hammer" architecture
11 the ?hammer? architecture l2 cache l1 instruction cache l1 data cache ?hammer? processor core ddr memory controller hypertransport ? . . . .
12 processor core overview level 2 cache l2 ecc l2 tags l2 tag ecc system request queue (srq) cross bar (xbar) memory controller & hypertransport ? agu alu agu alu agu alu fadd fmul fmisc 8 - entry scheduler 8 - entry scheduler 8 - entry scheduler 36 - entry scheduler 2k branch targets 16k history counter ras & target address fetch 2 - transit pick decode decode decode data tlb level 1 data cache ecc instr?n tlb level 1 instr?n cache pack pack pack decode 1 decode 2 decode 1 decode 2 decode 1 decode 2
13 processor core overview level 2 cache l2 ecc l2 tags l2 tag ecc system request queue (srq) cross bar (xbar) memory controller & hypertransport ? 2k branch targets 16k history counter ras & target address data tlb level 1 data cache ecc instr?n tlb level 1 instr?n cache pack agu alu agu alu agu alu fadd fmul fmisc 8 - entry scheduler 8 - entry scheduler 8 - entry scheduler 36 - entry scheduler fetch 2 - transit decode decode decode decode 1 decode 2 decode 1 decode 2 decode 1 decode 2 pick pack pack
14 processor core overview system request queue (srq) cross bar (xbar) memory controller & hypertransport ? agu alu agu alu agu alu fadd fmul fmisc 8 - entry scheduler 8 - entry scheduler 8 - entry scheduler 36 - entry scheduler 2k branch targets 16k history counter ras & target address fetch 2 - transit decode decode decode pack pack pack decode 1 decode 2 decode 1 decode 2 decode 1 decode 2 pick level 2 cache l2 ecc l2 tags l2 tag ecc data tlb level 1 data cache ecc instr?n tlb level 1 instr?n cache
15 "hammer" pipeline exec fetch 1 7 8 13 12 32 dram l2 19 20
16 fetch/decode pipeline fetch 1 fetch 1 fetch 2 fetch 2 exec fetch fetch 1 7 8 13 12 32 dram 19 20 l2 pick pick decode 1 decode 1 decode 2 decode 2 pack pack pack/decode pack/decode
17 execute pipeline 1 ns fetch 1 7 8 13 12 32 l2 dram 19 20 exec exec dispatch dispatch schedule schedule agu/alu agu/alu data cache 1 data cache 1 data cache 2 data cache 2
18 l2 pipeline l2 request l2 request address to l2 tag address to l2 tag l2 tag l2 tag l2 tag, l2 data l2 tag, l2 data l2 data l2 data data from l2 data from l2 data to dc mux data to dc mux write l1, forward write l1, forward exec fetch 1 7 8 13 12 32 dram l2 l2 19 20 5 ns 1 ns
19 address to nb address to nb clock boundary clock boundary srq load srq load srq schedule srq schedule gart/ addrmap cam gart/ addrmap cam gart/ addrmap ram gart/ addrmap ram xbar xbar coherence/order check coherence/order check mct schedule mct schedule dram cmd q load dram cmd q load dram page status check dram page status check dram cmd q schedule dram cmd q schedule request to dram pins request to dram pins ?. dram access ?. dram access pins to mct pins to mct through nb through nb clock boundary clock boundary across cpu across cpu ecc and mux ecc and mux write dc write dc dram pipeline exec fetch 1 7 8 13 12 32 l2 dram 19 20 1 ns 12 ns 5 ns l2 request address to l2 tag l2 tag l2 tag, l2 data l2 data data from l2 data to dc mux write l1, forward
20 ? sequential fetch ? predicted fetch ? branch target address calculator fetch ? mispredicted fetch large workload branch prediction l2 cache branch selectors evicted data branch selectors global history counter (16k, 2 - bit counters) target array (2k targets) 12 - entry return address stack (ras) branch target address calculator (btac) execution stages
21 large workload tlbs 24 - entry page descriptor cache pdp, pde l2 data cache flush filter cam 32 entry cr3, pdp, pde probe modify table walk tlb reload pdc reload tlb reload asn va pa l1 instruction tlb 40 entry fully associative 4m/2m & 4k pages l2 instruction tlb 512 - entry 4 - way associative asn va pa port 0, l1 data tlb 40 entry fully associative 4m/2m & 4k pages asn current asn l2 data tlb 512 - entry 4 - way associative asn va pa port 1, l1 data tlb 40 entry fully associative 4m/2m & 4k pages
22 ddr memory controller ? integrated memory controller details ? memory controller details ? 8 or 16 - byte interface ? 16 - byte interface supports ? direct connection to 8 registered dimms ? chipkill ecc ? unbuffered or registered dimms ? pc1600, pc2100, and pc2700 ddr memory ? integrated memory controller benefits ? significantly reduces dram latency ? memory latency improves ? as cpu and hypertransport ? link speed improves ? bandwidth and capacity grows with number of cpus ? snoop probe throughput scales with cpu frequency
23 reliability and availability ? l1 data cache ecc protected ? l2 cache and cache tags ecc protected ? dram ecc protected ? with chipkill ecc support ? on chip and off chip ecc protected arrays include background hardware scrubbers ? remaining arrays parity protected ? l1 instruction cache, tlbs , tags ? generally read only data which can be recovered ? machine check architecture ? report failures and predictive failure results ? mechanism for hardware/software error containment and recovery
24 hypertransport ? technology ? next - generation computing performance goes beyond the microprocessor ? screaming i/o for chip - to - chip communication ? high bandwidth ? reduced pin count ? point - to - point links ? split transaction and full duplex ? open standard ? industry enabler for building high bandwidth i/o subsystems ? i/o subsystems: pci - x, g - bit ethernet, infiniband , etc. ? strong industry acceptance ? 100+ companies evaluating specification & several licensing technologies through amd (2000) ? first hypertransport technology - based south bridge announced by nvidia (june 2001) ? enables scalable 2 - 8 processor smp systems ? glueless mp
25 cpu with integrated northbridge xbar ht* - hb ht* ht* mct cpu srq xbar ht* ht* - hb ht* mct cpu srq xbar ht* ht* - hb ht* mct cpu srq xbar ht* - hb ht* ht* mct cpu srq dram dram dram i/o i/o i/o hypertransport ? link coherent hypertransport ht* = hypertransport ? technology hb = host bridge dram i/o i/o
26 northbridge overview system request queue (srq) advanced priority interrupt controller (apic) crossbar (xbar) memory controller (mct) dram controller (dct) 64 - bit data 64 - bit command/address 16 - bit data/command/address cpu 0 data cpu 1 data cpu 0 probes cpu 1 probes cpu 0 requests cpu 1 requests cpu 0 int cpu 1 int hypertransport ? link 0 hypertransport link 1 hypertransport link 2 dram data ras/cas/ cntl
27 northbridge command flow address map & gart system request queue 24 - entry cpu 0 all buffers are 64 - bit command/address router 10 - entry buffer router 16 - entry buffer router 16 - entry buffer router 16 - entry buffer router 12 - entry buffer memory command queue 20 - entry cpu 1 hypertransport ? link 0 input hypertransport link 1 input hypertransport link 2 input victim buffer (8 - entry) write buffer (4 - entry) instruction mab (2 - entry) data mab (8 - entry) to dct hypertransport link 0 output hypertransport link 1 output hypertransport link 2 output to cpu xbar
28 northbridge data flow victim buffer (8 - entry) write buffer (4 - entry) 5 - entry buffer 8 - entry buffer 8 - entry buffer 8 - entry buffer 8 - entry buffer system request data queue 12 - entry memory data queue 8 - entry to cpu to host bridge to dct hypertransport link 0 output hypertransport link 1 output hypertransport link 2 output hypertransport ? link 0 input hypertransport link 1 input hypertransport link 2 input cpu 0 cpu 1 from host bridge from dct all buffers are 64 - byte cache lines xbar xbar
29 coherent hypertransport ? read request cpu 3 cpu 2 memory 1 memory 1 memory 1 memory 1 cpu 1 cpu 0 read cache line i/o i/o step 1 i/o i/o
30 coherent hypertransport ? read request cpu 3 cpu 2 memory 1 memory 1 memory 1 memory 1 cpu 1 cpu 0 read cache line i/o i/o step 2 i/o i/o 1: rdblk
31 coherent hypertransport ? read request cpu 3 cpu 2 memory 1 memory 1 memory 1 memory 1 i/o i/o cpu 1 cpu 0 read cache line probe request 2 probe request 0 probe request 3 step 3 i/o i/o 1: rdblk 2: rdblk
32 coherent hypertransport ? read request cpu 3 cpu 2 memory 1 memory 1 memory 1 memory 1 i/o i/o cpu 1 cpu 0 probe response 3 probe request 1 step 4 i/o i/o 1: rdblk 2: rdblk 3: prq2 3: prq3 3: prq0 3: rdblk
33 coherent hypertransport ? read request cpu 3 cpu 2 memory 1 memory 1 memory 1 memory 1 i/o i/o cpu 1 cpu 0 probe response 0 read response probe response 3 step 5 i/o i/o 1: rdblk 2: rdblk 3: prq2 3: prq0 3: rdblk 4: trsp3 4: prq1 3: prq3
34 coherent hypertransport ? read request cpu 3 cpu 2 memory 1 memory 1 memory 1 memory 1 i/o i/o cpu 1 cpu 0 probe response 2 read response step 6 i/o i/o 5: rdrsp 5: trsp3 5: trsp0 1: rdblk 2: rdblk 3: prq2 3: prq0 3: rdblk 4: trsp3 4: prq1 3: prq3
35 coherent hypertransport ? read request cpu 3 cpu 2 memory 1 memory 1 memory 1 memory 1 i/o i/o cpu 1 cpu 0 read response step 7 i/o i/o 3: prq3 5: rdrsp 5: trsp3 5: trsp0 1: rdblk 2: rdblk 3: prq2 3: prq0 3: rdblk 4: trsp3 4: prq1 6: rdrsp 6: trsp2
36 coherent hypertransport ? read request cpu 3 cpu 2 memory 1 memory 1 memory 1 memory 1 i/o i/o cpu 1 cpu 0 source done step 8 i/o i/o 3: prq3 5: rdrsp 5: trsp3 5: trsp0 1: rdblk 2: rdblk 3: prq2 3: prq0 3: rdblk 4: trsp3 4: prq1 6: rdrsp 6: trsp2 7: rdrsp
37 coherent hypertransport ? read request cpu 3 cpu 2 memory 1 memory 1 memory 1 memory 1 i/o i/o cpu 1 cpu 0 source done step 9 i/o i/o 3: prq3 5: rdrsp 5: trsp3 1: rdblk 2: rdblk 3: prq2 3: prq0 3: rdblk 4: trsp3 6: rdrsp 6: trsp2 7: rdrsp 9: srcdn 5: trsp0 4: prq1
38 "hammer" architecture summary ? 8th generation microprocessor core ? improved ipc and operating frequency ? support for large workloads ? cache subsystem ? enhanced tlb structures ? improved branch prediction ? integrated ddr memory controller ? reduced dram latency ? hypertransport ? technology ? screaming i/o for chip - to - chip communication ? enables glueless mp
"hammer" system architecture
40 ?hammer? system architecture 1 - way southbridge southbridge 8x agp "hammer" "hammer" hypertransport ? agp hypertransport ? agp int gfx
41 ?hammer? system architecture glueless multiprocessing: 2 - way southbridge southbridge 8x agp "hammer" "hammer" hypertransport ? agp hypertransport ? agp hypertransport pci - x hypertransport pci - x "hammer" "hammer"
42 ?hammer? system architecture glueless multiprocessing: 4 - way southbridge southbridge "hammer" "hammer" "hammer" "hammer" "hammer" "hammer" "hammer" "hammer" hypertransport pci - x hypertransport pci - x 8x agp hypertransport ? agp hypertransport ? agp agp optional hypertransport pci - x hypertransport pci - x
43 ?hammer? system architecture glueless multiprocessing: 8 - way "hammer" "hammer" "hammer" "hammer" "hammer" "hammer" "hammer" "hammer" "hammer" "hammer" "hammer" "hammer" ?hammer? ?hammer? "hammer" "hammer"
44 mp system architecture ? software view of memory is smp ? physical address space is flat and fully coherent ? latency difference between local and remote memory in an 8p system is comparable to the difference between a dram page hit and dram page conflict ? dram location can be contiguous or interleaved ? multiprocessor support designed in from the beginning ? lower overall chip count ? all mp system functions use cpu technology and frequency ? 8p system parameters ? 64 dimms (up to 128gb) directly connected ? 4 hypertransport links available for io (25gb/s)
45 the rewards of good plumbing ? bandwidth ? 4p system designed to achieve 8gb/s aggregate memory copy bandwidth ? with data spread throughout system ? leading edge bus based systems limited to about 2.1gb/s aggregate bandwidth (3.2gb/s theoretical peak) ? latency ? average unloaded latency in 4p system (page miss) is designed to be 140ns ? average unloaded latency in 8p system (page miss) is designed to be 160ns ? latency under load planned to increase much more slowly than bus based systems due to available bandwidth ? latency shrinks quickly with increasing cpu clock speed and hypertransport link speed
46 "hammer" summary ? 8 th generation cpu core ? delivering high - performance through an optimum balance of ipc and operating frequency ? x86 - 64? technology ? compelling 64 - bit migration strategy without any significant sacrifice of existing code base ? full speed support for x86 code base ? unified architecture from notebook through server ? ddr memory controller ? significantly reduces dram latency ? hypertransport ? technology ? high - bandwidth i/o ? glueless mp ? foundation for future portfolio of processors ? top - to - bottom desktop and mobile processors ? high - performance 1 - , 2 - , 4 - , and 8 - way servers and workstations
47 ?2001 advanced micro devices, inc. amd, the amd arrow logo, 3dnow! and combinations thereof are trademarks of advanced micro devices. hypertransport is a trademark of the hypertransport technology consortium. other product names are for informational purposes only and may be trademarks of their respective companies.

▲Up To Search▲

Price & Availability of PRESENTATION

	To Download PRESENTATION Datasheet File
If you can't view the Datasheet, Please click here to try to view without PDF Reader .